﻿ Tehnicide IngineriaLimbajuluiNatural Curs 3 Tehnologiiletextului Prelucrăripresintacticesisintactice Curs: Dan Cristea Laboratoare: Diana Trandabăț, Mihaela Onofrei, Daniela Gîfu, IonuțPistol The sub-syntactic layer INITIAL SUB-SYNTACTIC textSYNTACTIC PROCESSINGPROCESSINGPROCESSING SEMANTIC DISCOURSE PRAGMATIC PROCESSINGPROCESSINGresultPROCESSING The sub-syntactic layer INITIAL SUB-SYNTACTIC textSYNTACTIC PROCESSINGPROCESSINGPROCESSING SEMANTIC DISCOURSE PRAGMATIC PROCESSINGPROCESSINGresultPROCESSING SENTENCE RECOGNIZE BORDERSTOKENIZATIONPOS-TAGGINGLEMMASNP CHUNKING Part-of-speech (POS) tagging •POStaggingistheprocessofassigningapart- of-speechorlexicalclassmarkertoeachword inacorpus(JurafskyandMartin) WORDS TheTAGS couple spent theN honeymoonV onP aDET yacht POS Tagger Prerequisites •Lexicon of words •For each word in the lexicon: information about all its possible tags according to a chosen tagset •Different methods for choosing the correct tag for a word: –Rule-based methods –Statistical methods –Transformation based tagging –Neural methods POS Tagger Prerequisites: Lexicon of words •Classes of words –Closed classes: a fixed set •Prepositions:in, by, at, of, … •Pronouns:I, you, he, her, them, … •Particles: on, off, … •Determiners:the, a, an, … •Conjunctions:or, and, but, … •Auxiliary verbs:can, may, should, … •Numerals:one, two, three, … –Open classes: new ones can be created all the time, therefore it is not possible that all words from these classes appear in the lexicon •Nouns •Verbs •Adjectives •Adverbs POS Tagger Prerequisites Tagsets •To do POS tagging, need to choose a standard set of tags to work with •A tagset is normally sophisticated and linguistically well grounded •One could pick very coarse tagsets –N, V, Adj, Adv •More commonly used sets are finer grained, the “UPenn TreeBank tagset”, 48 tags •Even more fine-grained tagsets exist POS Tagger Prerequisites Tagset example –UPenn tagset 1CCCoordinating conjunction25TOto 2CDCardinal number26UHInterjection 3DTDeterminer27VBVerb, ba s e form 4EXExistential there28VB DVerb, pa s t tens e 5FWForeign word29VB GVerb, gerund/pres ent pa rtic iple 6INPreposition/subord conjunction30VB NVerb, pa s t pa rtic iple 7JJAdjec tive31VB PVerb, non-3rd ps s ing pres ent 8JJRAdjec tive, c ompa ra tive32VB ZVerb, 3rd ps s ing pres ent 9JJSAdjec tive, s uperla tive33WD Twh-determiner 10LSList item marker34WPwh-pronoun 11MDModal35WPPossessive wh-pronoun 12NNNoun, singular or mass36WR Bwh-adverb 13NNSNoun, plural37#Pound sign 14NNPProper noun, singular38$Dollar sign 15NNPSProper noun, plural39 Sentence-final punctuation 16PDTPredeterminer40,Comma 17POSPossessive ending41:Colon, semi-colon 18PRPPersonal pronoun42(Left bracket character 19PPPossessive pronoun43)Right bracket character 20RBAdverb44"Straight double quote 21RBRAdverb, c ompa ra tive45`Left open single quote 22RBSAdverb, s uperla tive46"Left open double quote 23RPParticle47'Right close single quote 24SYMSymbol (mathematical or scientific)48"Right close double quote POS Tagging Rule based methods •Start with a dictionary •Assign all possible tags to words from the dictionary •Write rules by hand to selectively remove tags •Leaving the correct tag for each word POS Tagging Statistical methods (1) The Most Frequent Tag Algorithm •Training –Take a tagged corpus –Create a dictionary containing every word in the corpus together with all its possible tags –Count the number of times each tag occurs for a word and compute the probability P(tag|word); then save all probabilities •Tagging –Given a new sentence, for each word, pick the most frequent tag for that word from the corpus POS Tagging Statistical methods (2) Bigram HMM Tagger •Training –Create a dictionary containing every word in the corpus together with all its possible tags –Compute the probability of each tag associated with a certain word, compute the probability each tag is preceded by a specific tag (Bigram HMM Tagger => probability is dependent only on the previous tag) •Tagging –Given a new sentence, for each word, pick up the most likely tag for that word using the parameters obtained after training –HMM Taggers choose the tag sequence that maximizes this formula: P(word|tag) * P(tag|previous tag) Bigram HMM Tagging: Example People/NNS are/VBZ expected/VBN to/TO queue/VB at/IN the/DT registry/NNS The/DT police/NN is/VBZ to/TO blame/VB for/IN the/DT queue/NN •to/TOqueue/??? the/DTqueue/??? •=> maxkP(tk|tk)*P(wi|tk) -1 –withe current word in sequence, tk= one possible tag for the current word •How do we compute P(tk|tk)? -1 –count(tktk)/count(tk) -1-1 •How do we compute P(wi|tk)? –count(wi tk)/count(tk) •max[P(VB|TO)*P(queue|VB), P(NN|TO)*P(queue|NN)] •Corpus: –P(NN|TO) = 0 021*P(queue|NN) = 0 00041=> 0 000007 –P(VB|TO) = 0 34*P(queue|VB) = 0 00003=> 0 00001 POS Tagging Transformation Based Tagging (1) •Combination of rule-based and stochastic tagging methodologies –Like rule-based because rule templates are used to learn transformations –Like stochastic approach because machine learning is used —with tagged corpus as input •Input: –tagged corpus –lexicon (with all possible tags for each word) POS Tagging Transformation Based Tagging (2) •Basic Idea: –Set the most probable tag for each word as a start value –Change tags according to rules of type “if word-1 is a determiner and word is a verb then change the tag to noun”in a specific order •Training is done on tagged corpus: 1 Write a set of rule templates 2 Among the set of rules, find one with the highest score 3 Continue from 2 until lowest score threshold is passed 4 Keep the ordered set of rules •Rules make errors that are corrected by later rules Transformation Based Tagging Example •Tag every word with its most-likely tag –For example: racehas the following probabilities in the Brown corpus: •P(NN|race) = 0 98 •P(VB|race)= 0 02 •Transformation rules make changes to tags –“Change NN to VB when previous tag is TO” … is/VBZ expected/VBN to/TO race/NNtomorrow/NN becomes … is/VBZ expected/VBN to/TO race/VBtomorrow/NN POS Taggers (1) ACOPOST Author(s):JochenHagenstroem,KilianFoth,IngoSchröder,ParantuShah Purpose:ACOPOSTisacollectionofPOStaggers Itimplementsandextendswell-known machinelearningtechniquesandprovidesauniformenvironmentfortesting Platforms:AllPOSIX(Linux/BSD/UNIX-likeOSes) Access:Freeathttp://sourceforge net/projects/acopost/ BRILL’STAGGER Author(s):EricBrill Purpose:TransformationBasedLearningPOSTagger Access:Freeathttp://www cs jhu edu/~brill fnTBL Author(s):RaduFlorianandGraceNgai,JohnHopkinsUniversity,USA Purpose:fnTBLisacustomizable,portableandfreesourcemachine-learningtoolkit primarilyorientedtowardsNaturalLanguage-relatedtasks(POStagging,baseNP chunking,textchunking,end-of-sentencedetection) ItiscurrentlytrainedforEnglish andSwedish Platforms:Linux,Solaris,Windows Access:Freeathttp://nlp cs jhu edu/~rflorian/fntbl/ POS Taggers (2) LINGSOFT Author(s):LINGSOFT,Finland Purpose:AmongtheservicesofferedbyLingsoftonecanfindPOStaggersforDanish, English,German,Norwegian,Swedish Access:Notfree Demosathttp://www lingsoft fi/demos html LTPOS(LTTTT) Author(s):LanguageTechnologyGroup,UniversityofEdinburgh,UK Purpose:TheLTPOSpartofspeechtaggerusesaHiddenMarkovModeldisambiguation strategy ItiscurrentlytrainedonlyforEnglish Access:Freebutrequiresregistrationathttp://www ltg ed ac uk/software/pos/index html MACHINESEPHRASETAGGER Author(s):Connexor Purpose:MachinesePhraseTaggerisasetofprogramcomponentsthatperformbasic linguisticanalysistasksatveryhighspeedandproviderelevantinformationabout wordsandconceptstovolume-intensiveapplications Availablefor:English,French, Spanish,German,Dutch,Italian,Finnish Access:Notfree Freeaccesstoonlinedemoathttp://www connexor com/demo/tagger/ POS Taggers (3) MXPOST Author(s):AdwaitRatnaparkhi Purpose:MXPOSTisamaximumentropyPOStagger Thedownloadableversionincludesa WallSt JournaltaggingmodelforEnglish,butcanalsobetrainedfordifferent languages Platforms:Platformindependent Access:Freeathttp://www cis upenn edu/~adwait/statnlp html MEMORYBASEDTAGGER Author(s):ILK-TilburgUniversity,CNTS-UniversityofAntwerp Purpose:Memory-basedtaggingisbasedontheideathatwordsoccurringinsimilar contextswillhavethesamePOStag Theideaisimplementedusingthememory-based learningsoftwarepackageTiMBL Access:UsablebyemailorontheWebathttp://ilk uvt nl/software html#mbt µ-TBL Author(s):TorbjörnLager Purpose:Theµ-TBLsystemisapowerfulenvironmentinwhichtoexperimentwith transformation-basedlearning Platforms:Windows Access:Freeathttp://www ling gu se/~lager/mutbl html POS Taggers (4) QTAG Author(s):OliverMason,BirminghamUniversity,UK Purpose:QTagisaprobabilisticparts-of-speechtagger ResourcefilesforEnglishand Germancanbedownloadedtogetherwiththetool Platforms:Platformindependent Access:Freeathttp://www english bham ac uk/staff/omason/software/qtag html STANFORDPOSTAGGER Author(s):KristinaToutanova,StanfordUniversity,USA Purpose:TheStanfordPOStaggerisalog-lineartaggerwritteninJava Thedownloadable packageincludescomponentsforcommand-lineinvocationandaJavaAPIbothfor trainingandforrunningatrainedtagger Platforms:Platformindependent Access:Freeathttp://nlp stanford edu/software/tagger shtml SVMTOOL Author(s):TALPResearchCenter,UniversityofCatalunya,Spain Purpose:TheSVMToolisasimpleandeffectivepart-of-speechtaggerbasedonSupport VectorMachines TheSVMLightsoftwareimplementationofVapnik'sSupportVector MachinebyThostenJoachimshasbeenusedtotrainthemodelsforCatalan,English andSpanish Access:Free SVMToolathttp://www lsi upc es/~nlp/SVMTool/and SVMLightathttp://svmlight joachims org/ POS Taggers (5) TnT Author(s):ThorstenBrants,SaarlandUniversity,Germany Purpose:TnT,theshortformofTrigrams'n'Tags,isaveryefficientstatisticalpart-of-speech taggerthatistrainableondifferentlanguagesandvirtuallyanytagset Thetaggerisan implementationoftheViterbialgorithmforsecondorderMarkovmodels TnTcomes withtwolanguagemodels,oneforGerman,andoneforEnglish Platforms:Platformindependent Access:Freebutrequiresregistrationathttp://www coli uni-saarland de/~thorsten/tnt/ TREETAGGER Author(s):HelmutSchmid,InstituteforComputationalLinguistics,UniversityofStuttgart, Germany Purpose:TheTreeTaggerhasbeensuccessfullyusedtotagGerman,English,French,Italian, Spanish,GreekandoldFrenchtextsandiseasilyadaptabletootherlanguagesifa lexiconandamanuallytaggedtrainingcorpusareavailable Access:Freeat http://www ims uni-stuttgart de/projekte/corplex/TreeTagger/DecisionTreeTagger html POS Taggers (6) XeroxXRCEMLTTPartOfSpeechTaggers Author(s):XeroxResearchCentreEurope Purpose:Xeroxhasdevelopedmorphologicalanalysersandpart-of-speechdisambiguators forvariouslanguagesincludingDutch,English,French,German,Italian,Portuguese, Spanish MorerecentdevelopmentsincludeCzech,Hungarian,PolishandRussian Access:Notfree Demosat http://www xrce xerox com/competencies/content-analysis/fsnlp/tagger en html YAMCHA Author(s):TakuKudo Purpose:YamChaisageneric,customizable,andopensourcetextchunkerorientedtoward alotofNLPtasks,suchasPOStagging,NamedEntityRecognition,baseNPchunking, andTextChunking YamChaisusingSupportVectorMachines(SVMs),firstintroduced byVapnikin1995 YamChaisexactlythesamesystemwhichperformedthebestinthe CoNLL2000SharedTask,ChunkingandBaseNPChunkingtask Platforms:Linux,Windows Access:Freeathttp://www2 chasen org/~taku/software/yamcha/ Stemming •StemmersareusedinIRtoreduceasmanyrelated wordsandwordformsaspossibletoacommon canonicalform–notnecessarilythebaseform– whichcanthenbeusedintheretrievalprocess •Frequently,theperformanceofanIRsystemwillbe improvediftermgroupssuchas:CONNECT, CONNECTED,CONNECTING,CONNECTION, CONNECTIONSareconflatedintoasingleterm(by removalofthevarioussuffixes-ED,-ING,-ION,-IONS toleavethesingletermCONNECT) Thesuffix strippingprocesswillreducethetotalnumberof termsintheIRsystem,andhencereducethesize andcomplexityofthedatainthesystem,whichis alwaysadvantageous Stemmers (1) ELLOGON Author(s):GeorgePetasis,VangelisKarkaletsis,NationalCenterforScientificResearch, Greece Access:Freeathttp://www ellogon org/ FSA Author(s):JanDaciuk,RijksuniversiteitGroningenandTechnicalUniversityofGdansk, Poland Purpose:Supportedlanguages:German,English,French,Polish Access:Freeathttp://juggernaut eti pg gda pl/~jandac/fsa html HEARTOfGOLD Author(s):UlrichSchäfer,DFKILanguageTechnologyLab,Germany Purpose:Supportedlanguage:Unicode,Spanish,Polish,Norwegian,Japanese,Italian, Greek,German,French,English,Chinese Access:Freeathttp://heartofgold dfki de/ Stemmers (2) LANGSUITE Author(s):PetaMem Purpose:Supportedlanguages:Unicode,Spanish,Polish,Italian,Hungarian,German, French,English,Dutch,Danish,Czech Access:Notfree Moreinformationathttp://www petamem com/ SNOWBALL Purpose:Presentationofstemmingalgorithms,andSnowballstemmers,forEnglish, Russian,Romancelanguages(French,Spanish,PortugueseandItalian),German,Dutch, Swedish,Norwegian,DanishandFinnish Access:Freeathttp://www snowball tartarus org/ SProUT Author(s):FeiyuXu,TimvorderBrück,LT-Lab,DFKIGmbH,Germany Purpose:Availablefor:Unicode,Spanish,Japanese,German,French,English,Chinese Access:Notfree Moreinformationathttp://sprout dfki de/ TWOL Author(s):Lingsoft Purpose:Supportedlanguages:Swedish,Norwegian,German,Finnish,English,Danish Access:Notfree Moreinformationathttp://www lingsoft fi/ Lemmatization •The process of grouping the inflected forms of a word together under a base form, or of recovering the base form from an inflected form, e g grouping the inflected forms COME, COMES, COMING, CAME under the base form COME •Dictionary based –Input: token + pos –Output: lemma •Note: needs POS information •Example: –left+v -> leave, left+a->left •It is the same as looking for a transformation to apply on a word to get its normalized form (word endings: what word suffix should be removed and/or added to get the normalized form) => lemmatization can be modeled as a machine learning problem Lemmatizers (1) CONNEXORLANGUAGEANALYSISTOOLS Author(s):Connexor,Finland Purpose:Supportedlanguages:English,French,Spanish,German,Dutch,Italian,Finnish Access:Notfree Demosathttp://www conexor fi/ ELLOGON Author(s):GeorgePetasis,VangelisKarkaletsis,NationalCenterforScientificResearch, Greece Access:Freeathttp://www ellogon org/ FSA Author(s):JanDaciuk,RijksuniversiteitGroningenandTechnicalUniversityofGdansk, Poland Purpose:Supportedlanguages:German,English,French,Polish Access:Freeathttp://juggernaut eti pg gda pl/~jandac/fsa html MBLEM Author(s):ILKResearchGroup,TilburgUniversity Purpose:MBLEMisalemmatizerforEnglish,German,andDutch Access:Demoathttp://ilk uvt nl/mblem/ Lemmatizers (2) SWESUM Author(s):HerculesDalianis,MartinHassel,KTH,EurolingAB Purpose:Supportedlanguages:Swedish,Spanish,German,French,English Access:Freeathttp://www euroling se/produkter/swesum html TREETAGGER Author(s):HelmutSchmid,InstituteforComputationalLinguistics,UniversityofStuttgart, Germany Purpose:TheTreeTaggerhasbeensuccessfullyusedforGerman,English,French,Italian, Spanish,GreekandoldFrenchtextsandiseasilyadaptabletootherlanguagesifa lexiconisavailable Access:Freeat http://www ims uni-stuttgart de/projekte/corplex/TreeTagger/DecisionTreeTagger html TWOL Author(s):Lingsoft Purpose:Supportedlanguages:Swedish,Norwegian,German,Finnish,English,Danish Access:Notfree Moreinformationathttp://www lingsoft fi/ Shallow Parsing (chunking) •Partition the input into sequences of non- overlapping units, or chunks: sequences of words labelled with syntactic categories and possibly a marking to indicate which word is the head of the chunk •How? –Set of regular expressions over POS labels –Training the chunker on manually marked up text Noun Phrase (NP) Chunkers fnTBL Author(s):RaduFlorianandGraceNgai,JohnHopkinsUniversity,USA Purpose:fnTBLisacustomizable,portableandfreesourcemachine-learningtoolkit primarilyorientedtowardsNaturalLanguage-relatedtasks(POStagging,baseNP chunking,textchunking,end-of-sentencedetection,wordsensedisambiguation) Itis currentlytrainedforEnglishandSwedish Platforms:Linux,Solaris,Windows Access:Freeathttp://nlp cs jhu edu/~rflorian/fntbl/ YAMCHA Author(s):TakuKudo Purpose:YamChaisageneric,customizable,andopensourcetextchunkerorientedtoward alotofNLPtasks,suchasPOStagging,NamedEntityRecognition,baseNPchunking, andTextChunking YamChaisusingSupportVectorMachines(SVMs),firstintroduced byVapnikin1995 YamChaisexactlythesamesystemwhichperformedthebestinthe CoNLL2000SharedTask,ChunkingandBaseNPChunkingtask Platforms:Linux,Windows Access:Freeathttp://www2 chasen org/~taku/software/yamcha/ Named Entity Recognition •Identification of proper names in texts, and their classification into a set of predefined categories of interest: –entities: organizations, persons, locations –temporal expressions: time, date –quantities: monetary values, percentages, numbers •Two kinds of approaches Knowledge EngineeringLearningSystems •rule based •use statistics or other machine •developed by experienced learning language engineers •developersdonotneedLE •make use of human intuition expertise •small amount of training data•requirelargeamountsof •very time consuming annotatedtrainingdata •some changes may be hard to •somechangesmayrequirere- accommodate annotationoftheentire trainingcorpus Named Entity Recognition Knowledge engineering approach •identificationofnamedentitiesintwosteps: –recognitionpatternsexpressedasWFSA(WeightedFinite- StateAutomaton)areusedtoidentifyphrasescontaining potentialcandidatesfornamedentities(longestmatch strategy) –additionalconstraints(dependingonthetypeofcandidate) areusedforvalidatingthecandidates •usageofon-linebaselexiconforgeographicalnames, firstnames Named Entity Recognition Problems •VariationofNEs,e g JohnSmith,Mr Smith,John •Sincenamedentitiesmayappearwithoutdesignators(companies, persons)adynamiclexiconforstoringsuchnamedentitiesisused Example: “Mars Ltdis a wholly-owned subsidiary of Food Manufacturing Ltd, a non- trading company registered in England Marsis controlled by members of the Marsfamily ” •Resolutionoftypeambiguityusingthedynamiclexicon: Ifanexpressioncanbeapersonnameorcompanyname(Martin MariettaCorp )thenusetypeoflastentryinsertedintodynamic lexiconformakingdecision •Issues of style, structure, domain, genre, etc •Punctuation, spelling, spacing, formatting Exemple de mențiuni de entități •entități imbricate/încuibărite (imbricated/nested) clădirea Universității din Iașiè [clădirea [Universității din [Iași]NP]NP]NP 3,NE22,NE11 clădirea Universității din Iași Exemple de mențiuni de entități •entități juxtapuse WesternandCentralEuropeè [[WesternT0]NEandT0[CentralT0EuropeT0]NE]NP 11/T0423421 Western and Central Europe Graphical Grammar Studio (GGS) •Framework for the development and processing of grammars •It incorporate a constraint description language: –implementation of composite features –look-ahead and look-behind assertions –priority scores can be placed on arcs èforces a preference order in processing paths •Designed with the main purpose to perform syntactical and sub-syntactical analysis GGS networks •Consume and annotate sequences of tokens or other XML elements The input tokens can include any number of associated attributes (usually denoting POSs, lemmas, articles in cases of nouns and adjectives, tokens IDs, etc ) •Structured as directed graphs: nodes express token consuming conditions, edges are directed: –some nodes can make jumps to other sub-graphs –networks are meant to be integrated into NLP chains –a GGS network is basically a finite state machine whose nodes can be associated with states GGS networks Bărăganul este cea mai mică câmpieèpath 5 è Bărăganul GGS networks oceanele, de la mare la mic, sunt după cum urmează: Pacific, Atlantic, Indian, Antarctic și Arcticè Pacific Atlantic Indian Antarctic Arctic Detectarea grupurilor nominale dr Radu Simionescu MappingBooks: Let’s jump out of the book in the real world! •First place in the BringITon!-2016 saloon of IT students creation •An UEFISCDI project that has run between 2014 – 2016: –partners: UAIC-FII (coordinator), University “Ștefan cel Mare”Suceava, Siveco –Bucharest MappingBooks: what is it about? Creating a more intimate link between the book and its reader •Recognise in text mentions of locations •Crawl the web for supplementary information •Know where the reader is •Point entities mentioned in the text that are in the reader’s proximity •Trace them on maps •Mix images with generated info Legarea entităților către lumea exterioară Linguistics Linked Open Data(LLOD) 42un subdomeniu al Prelucrării limbajului natural I like to read books and to travel… Going out of the book… Çelebi Mh , Maç Sk, Beyoğlu, Turkey to Çukur Cuma Cd, Beyoğlu, Turkey - Google Maps10/3/1310/3/13 8:13 PMKatip Çelebi Mh , Maç Sk, Beyoğlu, Turkey to Çukur Cuma Cd, Beyoğlu, Turkey - Google Maps 8:13 PMKatip Directions to Çukur Cuma Cd, Beyo!lu, Turkey 400 m – about 4 mins Walking directions are in beta Use caution – This route may be missing sidewalks or pedestrian paths Katip Çelebi Mh , Maç Sk, Beyo!lu, Turkey" Çukur Cuma Cd, Beyo!lu, Turkey" 1 Head southwest on Maç Sk toward Baltacı Çkgo 75 mls and About 47 secstotal 75 m These directions are for planning purposes only You may find that construction projects, traffic, weather, or other events may cause conditions toanua differ from the map results, and you should plan your route accordingly You must obey all signs or notices regarding your route Map data ©2013 Basarsoft 2 Turn right onto Turnacıba"ı Cdgo 28 mhy m total 100 mps ograuide 3 Turn left onto A!a Külhanı Sk (Altıpatlar Sk )go 130 m About 2 minstotal 240 mor Gelling g ate fave 4 Continue onto Çukur Cuma Cdgo 150 m About 1 mintotal 400 m Adequtr Page 2 of 2https://maps google com/maps?f=d&source=s d&saddr=Maç+Sokak,+I…,288 55,2 369,37 281,0&layer=c&ei=OqVNUp3mE8nTtAaWr4CgCQ&pw=2 Page 1 of 2https://maps google com/maps?f=d&source=s d&saddr=Maç+Sokak,+I…,288 55,2 369,37 281,0&layer=c&ei=OqVNUp3mE8nTtAaWr4CgCQ&pw=2 I need help to remember all kinship relations between characters 45 Characters in Forsyte Saga •The old Forsytes Ann, the eldest of the family Old Jolyon, the patriarch of the family, having made a fortune in tea James, a solicitor, married to Emily, a most tranquil woman Swithin, James's twin brother with aristocratic pretensions; a bachelor Roger, "the original Forsyte" Julia (Juley), a fluttery dowager; Mrs Septimus Small Hester, an old maid Nicholas, the wealthiest in the family Timothy, the most cautious man in England Susan, the married sister •The young Forsytes Young Jolyon, Old Jolyon's artistic and free-thinking son, married three times Soames, James and Emily's son, an intense, unimaginative and possessive solicitor, married to the unhappy Irene, who later marries Young Jolyon Winifred, Soames's sister, one of the three daughters of James and Emily, married to the foppish and lethargic Montague Dartie George, Roger's son, a dyed-in-the-wool mocker Francie, George's sister and Roger's daughter, emancipated from God •Their children June, Young Jolyon's defiant daughter from his first marriage; engaged to an architect, Philip Bosinney, who becomes Irene's lover Jolly, Young Jolyon's son from his second marriage; dies of enteric fever during the Boer Wars Holly, Young Jolyon's daughter from his second marriage, to June's governess Jon, Young Jolyon's son from his third marriage, to Irene, Soames's first wife Fleur, Soames's daughter from his second marriage, to a French Soho shopgirl Annette; Jon's lover; later marries a baronet, Michael Mont Val, Winifred and Montague's son; fights in the Boer Wars; marries his cousin Holly Imogen, Winifred and Montague's daughter •Others Parfitt, Old Jolyon's butler Smither, Aunts Ann, Juley and Hester's housekeeper Warmson, James and Emily's butler Bilson, Soames's housemaid46 Prosper Profond, Winifred's admirer and Annette's lover 47 MappingBooks •A MappedBook is a book connected with locations/events in the virtual and real world and sensitive to the instantaneous location (as seized by the mobile/tablet) of a reader •The information made available could possibly be different depending on the moment and the place of the reader MappingBooks •multi-dimensional mash-ups combining textual, geographical and temporal data •adequate presentation to the reader •links sensitive to: –the context of the mentions in the book, –the moment the user initiates an access –the current location of the user •make heavy use of entity linking techniques •spot the book mentions (persons and locations) in the real and virtual world 49 Aims 1) connect entities’mentions in the form of nominals (noun phrases) => one coreferential chain corresponds to each entity; 2) no preliminary records about linked entities => the knowledge base evolves from scratch; 3) look specially for coreferential (identity of entity mentions) and geographical relations (position, distance, point-of, near, intersects, etc ); 4) texts under investigation: Geography manuals and traveling guides MappingBooks: what is it about? •“Understand”parts of a text •Recognise mentions of persons and locations •Recognise and crawl for real world entities •Know where I am •Seize what real world entities are in my proximity •Trace GoogleMaps paths, as described in the book •Fetch, process and make use of geo-data •Mix images with generated info •Display an attractive user interface •Client-server MappingBooks –a bird’s eye view Towards… live books •Multidimensional artefacts that combine textual, geographical and temporal data •Evidencementions of persons and locations •Links sensible at: –the context of mentionin thebook –the location of the reader Maybe also at: –the moment of reading –the personality and preferences of the reader 53 Usage example -Ivisitacitywiththetravelingguideinmy hand -placesofinterest,routes,arereordereddepending onmyinstantaneousposition 54 Usage example -IamtravellingbytrainfromBrașovto Sibiu… -ifIopenmytabletandputitclosetotheleftside windowofthetrain,Iwillseearrowsshowingand namingpicksofFăgărașmountains,exactlyasinthe Geographymanual -then,clickingthearrow,theappwillfetchfragments fromthebookwherethesepeaksareexplained 55 Usage example -IaminParisforthe3rdtime… -butonlynowmyMBLonelyPlanetguidesignalsme thistemporaryexhibitionopenedinthePyramid 56 MappingBooks processing of name entities •Phase 1: pre-processing •Phase 2: gazetteer + pattern-matching •Phase 3: merging + validation MB phase 1: pre-processing (PRE) •extract text from pdf: with iTEXT (https://itextpdf com/en) ètext without formating •correct diacritics and special characters •eliminate end-of-line separators and other remains from the original format •linguistic processing: tokenisation + lemmatisation, POS-tagging, NP and sentence chunking MB phase 2: gazetteer-applier (GAZ-APP) •Gazetteers (GAZ): lists of toponyms and other geographic names, grouped by categories, to identify entity candidates èa document containing annotations for surface names mentioned in GAZ: –type, subtype, coordinates, and other related geographical data are added –where ambiguous, a name will contain multiple tags, one for each category/subcategory and the disambiguation process is postponed Gazetteer 1 LOCATION (with 23 subtypes): all locations usually referenced on the map of a region: cities, ports, streets, etc ; 2 GEO POSITION (with 6 subtypes): map references (parallel, meridian, cardinal point); 3 GEOLOGY (with 6 subtypes): geological formations visible on a specific map; 4 LANDFORM (with 16 subtypes): types of physiographic formations usually indicated on maps (mountain, valley, cave, etc ); 5 CLIME (with 5 subtypes): meteorological data shown on some types of maps; 6 WATER (with 11 subtypes): variation of surface aquatic formation (river, lake, strait, etc ); 7 DIMENSION (with 9 subtypes): various ways in which geographical entities can be accompanied by (exact or approximated) values in text (height, depth, surface, etc ); 8 PERSON: names of people, accompanied by professions, where specified; 9 ORGANISATION (with 5 subtypes): military, education, etc , indicating also possible locations associated with a particular organisation type; 10 URL: web references; 11 TIMEX: dates, moments of time, intervals, etc ; 12 RESOURCE (with 4 subtypes): natural resources associated with locations; 13 INDUSTRY (with 4 subtypes): industrial areas (factories, electrical plants, etc ); 14 CULTURAL (with 6 subtypes): cultural areas (museums, parks, etc ); 15 UNKNOWN: for other geographical entities not covered by the above types Geonameshttp://www geonames org/ •Open resource: –2 8 million entities, with 5 5 million alternative names –For Romania: 25 951 names, with over 45 000 alternative names •Names identified by letters: –A (country, state, region, …), H (stream, lake, …), L (parks, area, …), P (city, village, …), R (road, railroad, …), S (spot, building, farm, …), T (mountain, hill, rock, …), U (undersea, …), V (forest, heath, …) •For each of these types, besides geographical coordinates: values for specific attributes: •population (for P), surface (for A, H, L), height (for T), depth (for U), etc MB phase 2: pattern-matching (PAT-APP) •Uses a set of patterns, described in terms of the markings left in the document by the PRE module, to discover potential geographical entities •The difference between PAT-APP and GAZ-APP is that GAZ-APP makes use of strictly proper names, while the patterns include also contextual words that appear in their vicinity and which are used to reduce the ambiguities PAT-APP •Takes as input a sequence of XML elements and a GGS network and tries to find a path in the network from its starting node to its ending node TA = Text Analytics NER = Name Entity Recognition EC = Entity CrowlingAR = Augmented Reality RD = Relations DetectionDEV = Device Info GEO = GeographyINT = Interfaces M&T = Maps and TrajectoriesRES = Resources M&E = Management and Evaluation 64 Extend MB to exploit Tell me whto teat yyoll u rou user profiling who yoead u a re … •Data types for “stable” profiling –age, sex, education, profession, nationality, city and region address, traveling, music, readings and other cultural preferences, hobbies, etc •Sources: –connected social networks (Facebook, Twitter, LinkedIn, Google+) where MB users are members –direct acquisition at sign up ÞMakeup specific “stable” signatures of users as attribute-value vectors 65 Instantaneous contexts (volatile data collected during daily or sporadic use, quickly changing) •information seized by sensors or inferred: –the place where you are now, recent travels, the history of past travels •information accumulated on the server: –the book that you read now, recent readings, the history of your readings •information inputted by the user: –intention to start a journey 66 Volatile profiling •Contextual information ÞMakeup specific “volatile” signatures of users as attribute-value vectors Mixed profile •Sum up the two types of data used in profiling 67 Accessibility of personal data •In accord with the trendy typology of accessibility of data adopted by current social media networks: –public –restricted to classes of friends –for personal use only 68 Matching •Apply scored vector matching techniques against other users’signatures •Retain the best ranked matches 69 Networking Readers: enhance e-Books reading experience •Use semantic and geographical links to form communities –if read books is declared visible => •current co-readers of B •co-readers of B –if instantaneous or past location is declared visible => •current co-proximity of L •co-proximity of L •current co-track of T •co-track of T –any combinations 70 Networking Readers: enhance e-Books reading experience •Easy to imagine other ways to form communities rooted in lectures –intersect common readingsand attended placeswith levels of friendshipreported by other social media, like Facebook or Twitter –real-world eventsand entities mentionedin a book associated with real-world locationsand particular moments of the year/day 71 The syntactic layer INITIAL textSYNTACTIC SUB-SYNTACTIC PROCESSINGPROCESSINGPROCESSING SEMANTIC DISCOURSE PRAGMATIC PROCESSINGPROCESSINGresultPROCESSING The syntactic layer INITIAL textSYNTACTIC SUB-SYNTACTIC PROCESSINGPROCESSINGPROCESSING SEMANTIC DISCOURSE PRAGMATIC PROCESSINGPROCESSINGresultPROCESSING CLAUSE SYNTACTIC SEGMENTATIONPARSINGTrain a syntactic parser on a collection of syntactic trees (treebank) Exemplu de adnotare sintactică Parsarea sintactică drd Cătălina Mărănduc, dr Radu Simionescu Sintaxă vs semantică •Sintaxa => descrie regularități de formă ale unui limbaj –o exprimare poate fi ambiguă => mai multe interpretări –mai multe exprimări => aceeași reprezentare semantică •Semantica => descrie înțelesul exprimărilor, semnificația lor –reprezentările semantice nu au voie să dea loc la interpretări multiple –trebuie să fie independente de sintaxa limbajului